Data management in R

Gerko Vink

Methodology & Statistics @ Utrecht University

2 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Recap

Yesterday we have learned:

  1. How to use R and RStudio
  2. How to install packages
  3. How to use simple data containers
  4. How to do subsetting in base R with [ ]
  5. How to use logical operators to subset data

Today

Today we will learn how to:

This lecture

  • The blueprint of R
  • Code conventions en style
  • Basic data manipulation
  • Basic analysis (correlation & t-test)
  • R model formula and classes
  • Pipes

New packages we use

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes

New functions

  • transform(): changing and adding columns
  • dplyr::filter(): row-wise selection (of cases)
  • dplyr::arrange(): order rows by values of a column or columns (low to high)
  • table(): frequency tables
  • class(): object class
  • levels(): levels of a factor
  • order(): data entries in increasing order
  • haven::read_sav(): import SPSS data
  • cor(): bivariate correlation
  • sample(): drawing a sample
  • t.test(): t-test

The blueprint of R

Layers in R

There are several ‘layers’ in R. Some layers you are allowed to fiddle around in, some are forbidden. In general there is the following distinction:

  • The global environment.
  • User environments
  • Functions
  • Packages
  • Namespaces

Environments

The global environment can be seen as an olympic-size swimming pool. Everything you do has its place there.

If you’d like, you may create another, separate environment to work in.

  • A user environment would by default not have access to other environments

Functions

  • If you create a function, it is positioned in the global environment.

  • Everything that happens in a function, stays in a function. Unless you specifically tell the function to share the information with the global environment.

  • See functions as a shampoo bottle in a swimming pool to which you add some water. If you’d like to see the color of the mixture, you’d have to squeeze the bottle for it to come out.

Packages

  • Packages have their own space.

    • Everything needed to run the functions in a package is needly contained within its own space
    • See packages as separate (mini) pools that are connected to the main pool (the global environment)

Loading packages

There are two ways to load a package in R

library(stats)

and

require(stats)

require() will produce a warning when a package is not found. In other words, it will not stop as function library() does.

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

Namespaces

  • Namespaces. These are the deeper layers that feed new water to the surface of the mini pools.
    • Packages can have namespaces.
    • Functions within packages are executed within the package or namespace and have access to the global environment.
    • Objects in the global environment that match objects in the function’s namespace are ignored when running functions from packages!

R in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to environments, functions and namespaces.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

  • You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets from raw text.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

  • There are multiple ways to access the code history.

    1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
    2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in RStudio

  • Every project has its own history
  • Every research project has its own project
  • Every project can have its own folder, which also serves as a research archive
  • Every project can have its own version control system
  • R-studio projects can relate to Git (or other online) repositories

Modeling

Modeling in R

To model objects based on other objects, we use ~ (tilde)

For example, to model body mass index (BMI) on weight, we would type

BMI ~ weight

The ~ is used to separate the left- and right-hand sides in a model formula.

For functions (or models), within models we use I() - For example, to model body mass index (BMI) on its deterministic function of weight and height, we would type

BMI ~ I(weight / height^2)

Example

Remember the boys data from package mice:

lm(bmi ~ wgt, data = boys)

Call:
lm(formula = bmi ~ wgt, data = boys)

Coefficients:
(Intercept)          wgt  
    14.5401       0.0935  

Example continued

Remember the boys data from package mice:

lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)

Call:
lm(formula = bmi ~ I(wgt/(hgt/100)^2), data = boys)

Coefficients:
       (Intercept)  I(wgt/(hgt/100)^2)  
         -0.005553            1.000034  

Why we really need I()

lm(bmi ~ wgt + wgt^2, data = boys)

Call:
lm(formula = bmi ~ wgt + wgt^2, data = boys)

Coefficients:
(Intercept)          wgt  
    14.5401       0.0935  
lm(bmi ~ wgt + I(wgt^2), data = boys)

Call:
lm(formula = bmi ~ wgt + I(wgt^2), data = boys)

Coefficients:
(Intercept)          wgt     I(wgt^2)  
  16.271466    -0.041715     0.001604  

More efficient

It is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit.

fit <- lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)

The object fit contains a lot more than just the regression weights. To inspect what is inside you can use

ls(fit)
 [1] "assign"        "call"          "coefficients"  "df.residual"  
 [5] "effects"       "fitted.values" "model"         "na.action"    
 [9] "qr"            "rank"          "residuals"     "terms"        
[13] "xlevels"      

Inspecting what is inside fit

Another approach to inspecting the contents of fit is the function attributes()

attributes(fit)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"        

$class
[1] "lm"

The benefit of using attributes() is that it directly tells you the class of the object.

Classes in R

class(fit)
[1] "lm"

Classes are used for an object-oriented style of programming. This means that you can write a specific function that - has fixed requirements with respect to the input. - presents output or graphs in a predefined manner.

When a generic function fun is applied to an object with class attribute c("first", "second"), the system searches for a function called fun.first and, if it finds it, applies it to the object.

If no such function is found, a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.

Classes example: plotting without class

plot(bmi ~ wgt, data = boys)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 1)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 2)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 3)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 4)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 5)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 6)

Why is plot different for class "lm"?

The function plot() is called, but not used. Instead, because the linear model has class "lm", R searches for the function plot.lm().

If function plot.lm() would not exist, R tries to apply function plot() (which would have failed in this case because plot requires x and y as input)

plot.lm() is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm".

Since the elements that class "lm" returns are known, creating a generic function class is straightforward.

R-coding

The Google style guide

Naming conventions

File Names

File names should end in .R and, of course, be meaningful.

GOOD:

predict_ad_revenue.R

BAD:

foo.R

Identifiers

Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.

  1. The preferred form for variable names is all lower case letters and words separated with dots (variable.name), but variableName is also accepted;
  2. function names have initial capital letters and no dots (FunctionName);
  3. constants are named like functions but with an initial k.

Identifiers (continued)

  • variable.name is preferred, variableName is accepted
    GOOD: avg.clicks
    OK: avgClicks
    BAD: avg_Clicks

  • FunctionName
    GOOD: CalculateAvgClicks
    BAD: calculate_avg_clicks , calculateAvgClicks

  • kConstantName

Syntax

Line Length

The maximum line length is 80 characters.

# This is to demonstrate that at about eighty characters you would move off of the page

# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)

# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt 
          + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg 
          + bmi * hgt 
          + bmi * wgt
          + wgt * hgt 
          + wgt * hgt * bmi, 
          data = boys)

Indentation

When indenting your code, use two spaces. RStudio does this for you!

Never use tabs or mix tabs and spaces.

Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.

Spacing

Place spaces around all binary operators (=, +, -, <-, etc.).

Exception: Spaces around =’s are optional when passing parameters in a function call.

lm(age ~ bmi, data=boys)

or

lm(age ~ bmi, data = boys)

Spacing (continued)

Do not place a space before a comma, but always place one after a comma.

GOOD:

tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"])
total <- sum(x[, 1])
total <- sum(x[1, ])

Spacing (continued)

BAD:

# Needs spaces around '<'
tab.prior <- table(df[df$days.from.opt<0, "campaign.id"])  
# Needs a space after the comma
tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"])  
# Needs a space before <-
tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) 
# Needs spaces around <-
tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"])  
# Needs a space after the comma
total <- sum(x[,1])  
# Needs a space after the comma, not before 
total <- sum(x[ ,1])  

Spacing (continued)

Place a space before left parenthesis, except in a function call.

GOOD:

if (debug)

BAD:

if(debug)

Extra spacing

Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).

plot(x    = x.coord,
     y    = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
     ylim = ylim,
     xlab = "dates",
     ylab = metric,
     main = (paste(metric, " for 3 samples ", sep = "")))

Do not place spaces around code in parentheses or square brackets.

Exception: Always place a space after a comma.

Extra spacing

GOOD:

if (debug)
x[1, ]

BAD:

if ( debug )  # No spaces around debug
x[1,]  # Needs a space after the comma 

In general…

  • Use common sense and BE CONSISTENT.

  • The point of having style guidelines is to have a common vocabulary of coding

    • so people can concentrate on what you are saying, rather than on how you are saying it.
  • If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.

Basic data manipulation

The cats data

Description (see ?cats): The heart weights (in g) and body weights (in kg) of 144 adult cats (male and female) used for digitalis experiments. Digitalis is a medication used to treat heart failure and certain types of irregular heartbeat.

head(cats)
  Sex Bwt Hwt
1   F 2.0 7.0
2   F 2.0 7.4
3   F 2.0 9.5
4   F 2.1 7.2
5   F 2.1 7.3
6   F 2.1 7.6
str(cats)
'data.frame':   144 obs. of  3 variables:
 $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
 $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

The cats data

summary(cats)
 Sex         Bwt             Hwt       
 F:47   Min.   :2.000   Min.   : 6.30  
 M:97   1st Qu.:2.300   1st Qu.: 8.95  
        Median :2.700   Median :10.10  
        Mean   :2.724   Mean   :10.63  
        3rd Qu.:3.025   3rd Qu.:12.12  
        Max.   :3.900   Max.   :20.50  

How to get only Female cats?

# With dplyr::filter
fem.cats <- dplyr::filter(cats, Sex == "F")

dim(fem.cats)
[1] 47  3
# With base R:
# fem.cats <- cats[cats$Sex == "F", ]

How to get only heavy cats?

heavy.cats <- dplyr::filter(cats, Bwt > 3)
dim(heavy.cats)
[1] 36  3
head(heavy.cats)
  Sex Bwt  Hwt
1   M 3.1  9.9
2   M 3.1 11.5
3   M 3.1 12.1
4   M 3.1 12.5
5   M 3.1 13.0
6   M 3.1 14.3
# With base R:
# heavy.cats2 <- cats[cats$Bwt > 3, ]
# or:
# heavy.cats3 <- subset(cats, Bwt > 3)

Select cats with specified range of body weights

dplyr::filter(cats, Bwt > 2.0 & Bwt < 2.2, Sex == "F")
  Sex Bwt Hwt
1   F 2.1 7.2
2   F 2.1 7.3
3   F 2.1 7.6
4   F 2.1 8.1
5   F 2.1 8.2
6   F 2.1 8.3
7   F 2.1 8.5
8   F 2.1 8.7
9   F 2.1 9.8

Working with factors

class(cats$Sex)
[1] "factor"
levels(cats$Sex)
[1] "F" "M"

Working with factors

Changing the level names:

levels(cats$Sex) <- c("Female", "Male")

table(cats$Sex)

Female   Male 
    47     97 
head(cats)
     Sex Bwt Hwt
1 Female 2.0 7.0
2 Female 2.0 7.4
3 Female 2.0 9.5
4 Female 2.1 7.2
5 Female 2.1 7.3
6 Female 2.1 7.6

Sorting with dplyr::arrange()

dplyr::arrange()sorts the rows of a data frame according to the order of one or more of the columns.

# sort cats by body weight (Bwt) from low to high values
sorted.cats <- dplyr::arrange(cats, Bwt)
head(sorted.cats)
     Sex Bwt Hwt
1 Female 2.0 7.0
2 Female 2.0 7.4
3 Female 2.0 9.5
4   Male 2.0 6.5
5   Male 2.0 6.5
6 Female 2.1 7.2

Sorting with dplyr::arrange()

dplyr::arrange()sorts the rows of a data frame according to the order of one or more of the columns.

# Sort the cats by Sex first and within Sex (1=female) by body weight (low to high)
sorted.cats.grouped <- dplyr::arrange(cats, Sex, Bwt)
head(sorted.cats.grouped)
     Sex Bwt Hwt
1 Female 2.0 7.0
2 Female 2.0 7.4
3 Female 2.0 9.5
4 Female 2.1 7.2
5 Female 2.1 7.3
6 Female 2.1 7.6

Combining matrices or dataframes

# add column with id numbers
ID <- matrix(c(1:144), nrow=144, ncol=1)
# add variable name for ID variable and change names of other variables
D <- dplyr::bind_cols(ID, cats)
names(D) <- c("ID", "Sex", "Bodyweigt", "Hartweight")
head(D)
  ID    Sex Bodyweigt Hartweight
1  1 Female       2.0        7.0
2  2 Female       2.0        7.4
3  3 Female       2.0        9.5
4  4 Female       2.1        7.2
5  5 Female       2.1        7.3
6  6 Female       2.1        7.6

Basic analysis

Correlation

cor(cats[, -1])
          Bwt       Hwt
Bwt 1.0000000 0.8041274
Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column (female/male)

Correlation

cor.test(cats$Bwt, cats$Hwt)

    Pearson's product-moment correlation

data:  cats$Bwt and cats$Hwt
t = 16.119, df = 142, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7375682 0.8552122
sample estimates:
      cor 
0.8041274 

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)

    Welch Two Sample t-test

data:  Hwt by Sex
t = -6.5179, df = 140.61, p-value = 1.186e-09
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -2.763753 -1.477352
sample estimates:
mean in group Female   mean in group Male 
            9.202128            11.322680 

T-test

plot(formula = Hwt ~ Sex, data = cats)

Linear regression

Remember the boys data from package mice.

If we want to regress BMI (bmi) on weight (wgt), we would use the lm() function and the model formula bmi ~ wgt.

lm(formula = bmi ~ wgt, data = boys)

Call:
lm(formula = bmi ~ wgt, data = boys)

Coefficients:
(Intercept)          wgt  
    14.5401       0.0935  

Scatterplot

plot(formula = bmi ~ wgt, data = boys)

Add a quadratic term

lm(formula = bmi ~ wgt + I(wgt^2), data = boys)

Call:
lm(formula = bmi ~ wgt + I(wgt^2), data = boys)

Coefficients:
(Intercept)          wgt     I(wgt^2)  
  16.271466    -0.041715     0.001604  

Why we need I()

Inside an R model formula, the +, :, and ^ operators are text objects and therefore behave differently than when used in calculations with numeric vectors. Using I() converts these operators to their numerical meaning.

lm(bmi ~ wgt + wgt^2, data = boys)

Call:
lm(formula = bmi ~ wgt + wgt^2, data = boys)

Coefficients:
(Intercept)          wgt  
    14.5401       0.0935  
lm(bmi ~ wgt + I(wgt^2), data = boys)

Call:
lm(formula = bmi ~ wgt + I(wgt^2), data = boys)

Coefficients:
(Intercept)          wgt     I(wgt^2)  
  16.271466    -0.041715     0.001604  

Output of the lm() class

It is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit.

fit <- lm(bmi ~ wgt + I(wgt^2), data = boys)

The object fit contains a lot more than just the regression weights. To inspect what is inside you can use

ls(fit)
 [1] "assign"        "call"          "coefficients"  "df.residual"  
 [5] "effects"       "fitted.values" "model"         "na.action"    
 [9] "qr"            "rank"          "residuals"     "terms"        
[13] "xlevels"      

Inspecting what is inside fit

Another approach to inspecting the contents of fit is the function attributes()

attributes(fit)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"        

$class
[1] "lm"

The benefit of using attributes() is that it directly tells you the class of the object.

Classes in R

class(fit)
[1] "lm"

Classes are used for an object-oriented style of programming. This means that you can write a specific function that

  • has fixed requirements with respect to the input.
  • presents output or graphs in a predefined manner.

When a generic function fun is applied to an object with class attribute c("first", "second"), the system searches for a function called fun.first and, if it finds it, applies it to the object.

If no such function is found, a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.

Classes example: plotting without class

plot(bmi ~ wgt, data = boys)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 1)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 2)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 4)

Why is plot different for class "lm"?

The function plot() is called, but not used. Instead, because the linear model has class "lm", R searches for the function plot.lm().

If function plot.lm() would not exist, R tries to apply function plot() (which would have failed in this case because plot requires x and y as input)

plot.lm() is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm".

Since the elements that class "lm" returns are known, creating a generic function class is straightforward.

Pipes

This is a pipe:

boys <- 
  read_sav("files/boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("files/boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("files/boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory, the steps are easy to follow and understand.

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

Keyboard shortcut:

  • Windows/Linux: ctrl + shift + m
  • Mac: cmd + shift + m

What do pipes do:

What do pipes do:

Pipes create functions without nesting:

f(x) becomes x %>% f()

set.seed(123)
mean(rnorm(10))
[1] 0.07462564
set.seed(123)
rnorm(10) %>% 
  mean()
[1] 0.07462564

Pipes create functions without nesting

f(x, y) becomes x %>% f(y)

cor(boys, use = "pairwise.complete.obs")
          hgt       wgt       bmi
hgt 1.0000000 0.6100784 0.1758781
wgt 0.6100784 1.0000000 0.8841304
bmi 0.1758781 0.8841304 1.0000000
boys %>% 
  cor(use = "pairwise.complete.obs")
          hgt       wgt       bmi
hgt 1.0000000 0.6100784 0.1758781
wgt 0.6100784 1.0000000 0.8841304
bmi 0.1758781 0.8841304 1.0000000

Pipes create functions without nesting

h(g(f(x))) becomes x %>% f %>% g %>% h

max(na.omit(subset(boys, select = wgt)))
[1] 117.4
boys %>% 
  subset(select = wgt) %>% 
  na.omit() %>% 
  max()
[1] 117.4

Useful: outlier filtering

nrow(cats)
[1] 144
# select the observations that fall outside the +/- SD around the mean
cats.outl <- 
  cats %>% 
  dplyr::filter(Hwt > mean(Hwt) + 3 * sd(Hwt) | Hwt < mean(Hwt) - 3 * sd(Hwt))

nrow(cats.outl)
[1] 1
# select the observations that fall within +/- 3 SD around the mean
cats.without.outl <- 
  cats %>% 
  dplyr::filter(Hwt < mean(Hwt) + 3 * sd(Hwt) & Hwt > mean(Hwt) - 3 * sd(Hwt))

nrow(cats.without.outl)
[1] 143

More pipe stuff

Pipes and the R model formula ~

Sometimes the data we want to use, are “piped” in the wrong argument, see e.g.:

cats %>% 
  lm(Hwt ~ Bwt)

This leads to an error:

Error in as.data.frame.default(data) :

cannot coerce class ‘“formula”’ to a data.frame

The %>% pipe operator is intended to work with functions where each result is forwarded on to the first argument of the next function. For the function lm() the first argument is the model formula, which is a text object and not the data frame.

Solution 1

We can use the . symbol to act as placeholder for the data:

cats %>% 
  lm(Hwt ~ Bwt, data = .)

Call:
lm(formula = Hwt ~ Bwt, data = .)

Coefficients:
(Intercept)          Bwt  
    -0.3567       4.0341  

Solution 2: the exposition %$% pipe

Solution 2: the exposition %$% pipe

The exposition %$% pipe exposes the content(s) of the data frame (the variables) to the next function in the pipeline.

# Now the code works without need of the placeholder

cats %$% 
  lm(Hwt ~ Bwt)

Call:
lm(formula = Hwt ~ Bwt)

Coefficients:
(Intercept)          Bwt  
    -0.3567       4.0341  

Debugging pipelines

Sample 3 positions from the alphabet and show the position and the letter. If you don’t know what’s going on, run each statement separately!

set.seed(123)
1:26
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
set.seed(123)
1:26 %>% 
  sample(3)
[1] 15 19 14
set.seed(123)
1:26 %>%
  sample(3) %>%
  paste(., LETTERS[.])
[1] "15 O" "19 S" "14 N"

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)

    Welch Two Sample t-test

data:  Hwt by Sex
t = -6.5179, df = 140.61, p-value = 1.186e-09
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -2.763753 -1.477352
sample estimates:
mean in group Female   mean in group Male 
            9.202128            11.322680 

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- 
  cats %$%
  t.test(Bwt ~ Sex)

cats.test

    Welch Two Sample t-test

data:  Bwt by Sex
t = -8.7095, df = 136.84, p-value = 8.831e-15
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.6631268 -0.4177242
sample estimates:
mean in group Female   mean in group Male 
            2.359574             2.900000